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We consider extension of Granger causality to nonlinear bivariate time series. In this frame, if 
the prediction error of the first time series is reduced by including measurements from the second 
time series, then the second time series is said to have a causal influence on the first one. Not all 
the nonlinear prediction schemes are suitable to evaluate causality, indeed not all of them allow 
to quantify how much the knowledge of the other time series counts to improve prediction error. 
We present a novel approach with bivariate time series modelled by a generalization of radial basis 
functions and show its application to a pair of unidirectionally coupled chaotic maps and to a 
(— i , physiological example. 

ctf ; 

1 I. INTRODUCTION 

C0 ■ 

Identifying causal relations among simultaneously acquired signals is an important problem in computational time 
, series analysis and has applications in economy [1-2], EEG analysis Q, human cardiorespiratory system jj], interaction 
between heart rate and systolic arterial pressure and many others. Severalpapers dealt with this problem relating 
it to the identification of interdependence in nonlinear dynamical systems 6], or to estimates of information rates 
0,0- Some approaches modelled data by oscillators and concentrated on the phases of the signals 0. One major 
approach to analyze causality between two time series is to examine if the prediction of one series could be improved 
by incorporating information of the other, as proposed by Granger pj in the context of linear regression models of 
stochastic processes. In particular, if the prediction error of the first time series is reduced by including measurements 
from the second time series in the linear regression model, then the second time series is said to have a causal influence 
on the first time series. By exchanging roles of the two time series, one can address the question of causal influence 
, in the opposite direction. It is worth stressing that, within this definition of causality, flow of time plays a major role 
in making inference, from time series data, depending on direction. Since Granger causality was formulated for linear 
models, its application to nonlinear systems may not be appropriate. The question we address in this paper is: how 
is it possible to extend Granger causality definition to nonlinear problems? 
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C/2 \ In the next section we review the original approach by Granger while describing our point of view about its nonlinear 
extension; we also propose a method, exploiting radial basis functions, which fulfills the requirements a prediction 
scheme should satisfy to analyze causality. In section l|lll|) we show application of the proposed method to simulated 
and real examples. Some conclusions are drawn in Section IjlVp . 



II. GRANGER CAUSALITY 



» I 1 A. Linear modelling of bivariate time series. 

a i 

We briefly recall the Vector AutoRegressive (VAR) model which is used to define linear Granger causality [lj. Let 
and {yi\i=i..,N be two time series of N simultaneously measured quantities. In the following we will 
assume that time series are stationary. For k = 1 to M (where M = N — m, m being the order of the model), we 
denote x k = x k+m , y k = y k + m , X fc = (x k + m -i, %k+m-2, —, Xk), Y fc = [yk+m-i, Vk+m-i, —, Vk) and we treat these 
quantities as M realizations of the stochastic variables (x, y, X, Y). The following model is then considered \UJ^ : 

x = W 11 -X + W 12 -Y, 

y = W 21 -X + W 22 -Y, W 
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{W} being four m-dimensional real vectors to be estimated from data. Application of least squares techniques yields 
the solutions: 

Wn\ ( E xx S I9 VVt u 
W 12 J \ T, yx Y> yy ) \ T12 



and 

\ W22 j \ — uj y v T22 

where £ matrices and T vectors are the estimates, based on the data set at hand, of the following average values: 
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Let us call e xy and e yx the prediction errors of this model, defined as the estimated variances of x — Wn • X — W 12 • Y 
and y — W21 ■ X — W 22 • Y respectively. In particular 



xv = & Ek=i(x k ~ W U • X k W 12 • Y fc ) 2 ; 
ev* - Jj T.lUv k ~ W 21 ■ X fc - W 22 ■ Y fe ) 2 



We also consider AutoRegressive (AR) predictions of the two time series, i.e. the model 

x = Vi • X, , . 

y = v 2 ■ Y. w 

In this case the least squares approach provides Vi = E^Tn and V 2 = S !/a 1 T 22 . The estimate of the variance of 
x — Vi • X is called e x (the prediction error when x is predicted solely on the basis of the knowledge of its past values) ; 
similarly e y is the variance of y — V 2 ■ Y. If the prediction of x improves by incorporating the past values of {?/;}, i.e. 
e xy is smaller than e x , then y has a causal influence on x. Analogously, if e yx is smaller than e y , then x has a causal 
influence on y. Calling c\ = e x — e xy and ci = e y — e yx , a directionality index can be introduced: 

D= C -^. (5) 

Cl + C 2 

The index D varies from 1 in the case of unidirectional influence (x — > y) to — 1 in the opposite case (y —> x), with 
intermediate values corresponding to bidirectional influence. According to this definition of causality, the following 
property holds for N sufficiently large: if Y is uncorrelated with X and x, then e x — e xy . Indeed in this case 
Y, xy = Sya, = and T 12 = 0, therefore W 12 = 0. This means that VAR and AR modelling of the {xi} time 
series coincide. Analogously if X is uncorrelated with Y and y, then e y — e yx . It is clear that these properties are 
fundamental and make the linear prediction approach suitable to evaluate causality. On the other hand, for nonlinear 
systems higher order correlations may be relevant. Therefore, we propose that any prediction scheme providing a 
nonlinear extension of Granger causality should satisfy the following property: (PI) if Y is statistically independent 
of X and x, then e x = e xy ; if X is statistically independent of Y and y, then e y = e yx . In a recent paper 
use of a locally linear prediction scheme ^1 has been proposed to evaluate nonlinear causality. In this scheme, the 
joint dynamics of the two time series is reconstructed by delay vectors embedded in an Euclidean space; in the delay 
embedding space a locally linear model is fitted to data. The approach described in satisfies property PI only 
if the number of points in the neighborhood of each reference point, where linear fit is done, is sufficiently high to 
establish good statistics; however linearization is valid only for small neighborhoods. It follows that this approach to 
nonlinear causality requires very long time series to satisfy PI. In order to construct methods working effectively with 
moderately long time series, in the next subsection we will characterize the problem of extending Granger causality 
as the one of finding classes of nonlinear models satisfying property PI. 
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B. Nonlinear models. 

What is the most general class of nonlinear models which satisfy PI? The complete answer to this question is 
matter for further study. Here we only give a partial answer, i.e. the following family of models: 

x = wn • *(X) +w 12 • *(Y) , , . 

y = W2i ■ * (X) + w 22 • * (Y) , W 

where {w} are four n-dimensional real vectors, 3? = (<pi, tp n ) are n given nonlinear real functions of m variables, 
and SI/ = ...,ip n ) are n other real functions of m variables. Given ffr and SI/, model © is a linear function in 
the space of features <p and ip; it depends on in variables, the vectors {w}, which must be fixed to minimize the 
prediction errors 



**» = 17 ElUx k ~ wu • * (X fe ) w 12 * (Y*)) 2 ; 
ty* = ti ElUy k - w al • # (X fc ) w 22 * (Y fc )) 



We also consider the model: 

X = V! * (X) , 

y = v 2 • * (Y) , W 

and the corresponding prediction errors e x and e y . 

Now we prove that model 10 satisfies PI. Let us suppose that Y is statistically independent of X and x. Then, 
for each p = 1, .., n and for each A = 1, .., n: ip^ (Y) is uncorrelated with x and with tp\ (X). It follows that 

variance [x — Wu • <& (X) — wi 2 • SI/ (Y)] = variance [x — wu • $ (X)] + variance [wi 2 • SI/ (Y)] . (9) 

As a consequence, for large N, at the minimum of e xy one has Wi 2 = 0. The same argument may be used exchanging 
x and y. This proves that PI holds. 

The solution of least squares fitting of model JfJJl to data may be written in the following form: 



Wu 
W12 



W 2 1 
W 2 2 



= (Si S 2 ) + ti, 



= (S 2 Si^ta, 



where t denotes the pseudo-inverse matrix [l3l | ; S matrices and t vectors are given by: 

[Si] fcp = ip p (X fe ) k = l,...,M,p=l,...,n 



[S 2 ] fcp = ^p(Y fe ) k=l,...,M,p=l,...,n nm 
[ti] k = x k k = l,...,M 1 UJ 

[t 2 ] k = y k k = l,...,M 

Solution of model JHJ is given by vi = Si^ti and V2 = S2^t2. 

C. Radial basis functions. 

Radial basis functions (RBF) methods were initially proposed to perform exact interpolation of a set of data points 
in a multidimensional space (see, e.g., subsequently an alternative motivation for RBF methods was found 

within regularization theory RBF models have been used to model financial time series |l6|. 

In this subsection we propose a strategy to choose the functions $ and SI/, in model (JSJl, in the frame of RBF 
methods. Fixed n <C M, n centers {X p }^ =1 , in the space of X vectors, are determined by a clustering procedure 
applied to data {X Analogously n centers {Y p }™ =1 , in the space of Y vectors, are determined by a clustering 

procedure applied to data {Y k } k vr =1 . We then make the following choice: 

¥> P (X) =expf-||X-X p || 2 /2^ p=l,...,n, 

V-p(Y) =exp(-||Y-Y p || 2 /2a 2 ) p=l,...,n, " i! 
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a being a fixed parameter, whose order of magnitude is the average spacing between the centers. Centers {X p } are the 
prototypes of X variables, hence ip functions measure the similarity to these typical patterns. Analogously, ip functions 
measure the similarity to typical patterns of Y. Many clustering algorithm may be applied to find prototypes, for 
example in our experiments we use fuzzy c- means |17| . 

Some remarks are in order. First, we observe that the models described above may trivially be adapted to handle 
the case of reconstruction embedding of the two time series in a delay coordinate space, as described in ^l|- Second, 
we stress that in (JSJ) x and y are modelled as the sum of two contributions, one depending solely on X and the other 
dependent on Y. Obviously better prediction models for x and y exists, but they would not be useful to evaluate 
causality unless they would satisfy PI. This requirement poses a limit to the level of detail at which the two time 
series may be described, if one is looking at causality relationships. The justification of the model we propose here, 
based on rcgularization theory, is sketched in the Appendix. 



D. Empirical risk and generalization error. 

In the previous subsections the prediction error has been identified as the empirical risk, although there is a difference 
between these two quantities as Statistical Learning Theory (SLT) shows. The deep connection between empirical 
risk and generalization error deserves a comment here. First of all we want to point out that the ultimate goal of 
a predictor and in general of any supervised machine x = /(X) fUl] is to generalize, that is to correctly predict 
the output values x corresponding to never seen before input patterns X (for definiteness we consider the case of 
predicting x on the basis of the knowledge of X). A measure of the generalization error of such a machine / is the 
risk R[f] defined as the expected value of the loss function V (x, /(X)): 

R[f] = J dxdX V(x,f(X))P(x,X), (12) 

where P(x, X) is the probability density function underlying the data. A typical example of loss function is 
V (x, /(X)) = (x - f(X)) z and in this case the function minimizing R[f] is called the regression function. In gen- 
eral P is unknown and so we can not minimize the risk. The only data we have are M observations (examples) 
S = {(x k ,X k )}^ I =1 of the random variables x and X drawn according to P(x,X). Statistical learning theory |l£| as 
well as regularization theory fill ] provide upper bounds of the generalization error of a learning machine /. Inequalities 
of the following type may be proven: 

R[f]<e x + C, (13) 

where 

A I 



-E 

M ^ 

k=l 



'x k 



f(X k )) (14) 



is the empirical risk, that measures the error on the training data. C is a measure of the complexity of machine / 
and it is related to the so-called Vapnik-Chervonenkis (VC) dimension. Predictors with low complexity guarantee low 
generalization error because they avoid overfitting to occur. When the complexity of the functional space where our 
predictor lives is small, then the empirical risk is a good approximation of the generalization error. The models we 
deal with in this work verify such constraint. In fact, linear predictors have a finite VC-dimension, equal to the size of 
the space where the input patterns live, and predictors expressed as linear combinations of radial basis functions are 
smooth. In conclusion empirical risk is a good measure of the generalization error for the predictors we are considering 
here and so it can be used to construct measures of causality between time series pcj . 



III. EXPERIMENTS. 



In order to demonstrate the use of the proposed approach, in this section we study two examples, a pair of 
unidirectionally coupled chaotic maps and a bivariate physiological time series. 
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A. Chaotic maps. 



Let us consider the following pair of noisy logistic maps: 



x n +\ = a x n (1 - x n ) + sr) n+ i, 

y n+ i = e a y„ (1 - y n ) + (l-e)oi n (1 - %n) + s£n+i! 



(15) 



{77} and {£} are unit variance Gaussianly distributed noise terms; parameter s determines their relevance. We fix 
a = 3.8, and e G [0, 1] represents the coupling x — > y. In the noise-free case (s = 0), a transition to synchronization 
(x n — y n ) occurs at e = 0.37. We evaluate the Lyapunov exponents by the method described in |2lj : the first 
exponent is 0.43, the second exponent depends on e and is depicted in Fig. 1 for e < 0.37 (it becomes negative for 
e > 0.37). For several values of e, we have considered runs of 10 5 iterations, after 10 5 transient, and evaluated the 
prediction errors by © and (JHJ), with m = 1, n = 100 and a = 0.05. In fig. 2a we depict, in the noise free case, the 
curves representing c\ and c-i versus coupling e. In figures 2b, 2c and 2d we depict the directionality index D versus e, 
in the noise free case and for s = 0.01 and s = 0.07 respectively. In the noise free case we find D = 1, i.e. our method 
revealed unidirectional influence. As the noise increases, also the minimum value of e, which renders unidirectional 
coupling detectable, increases. 



As a real example, we consider time series of heart rate and breath rate of a sleeping human suffering from sleep 
apnea (ten minutes from data set B of the Santa Fe Institute time series contest held in 1991, available in the Physionet 
data bank There is a growing evidence that suggests a causal link between sleep apnea and cardiovascular disease 

[i^l, although the exact mechanisms that underlie this relationship remain unresolved [24|]. Figure 3 clearly shows 
that bursts of the patient breath and cyclical fluctuations of heart rate are interdependent. We fix m — 1 and n = 50; 
varying a we find that both e x [x representing heart rate) and e y (y representing breath) have a minimum at a close 
to 0.5. In fig. 4 we depict the directionality index D vs a, around a = 0.5. Since we find D positive, we may conclude 
that the causal influence of heart rate on breath is stronger than the reverse [2j|. This data have been already 
analyzed in [7J, measuring the rate of information flow (transfer entropy), and a stronger flow of information from the 
heart rate to the breath rate was found. In this example, the rate of information flow entropy and Granger nonlinear 
causality give consistent results: both these quantities, in the end, measure the departure from the generalized Markov 
property Q 



The components of complex systems in nature rarely display a linear interdependence of their parts: identification 
of their causal relationships provides important insights on the underlying mechanisms. Among the variety of methods 
which have been proposed to handle this important task, a major approach was proposed by Granger [lj. It is based 
on the improvement of predictability of one time series due to the knowledge of the second time series: it is appealing 
for its general applicability, but is restricted to linear models. While extending Granger approach to the nonlinear 
case, on one hand one would like to have the most accurate modelling of the bivariate time series, on the other hand 
the goal is to quantify how much the knowledge of the other time series counts to reach this accuracy. Our analysis is 
rooted on the fact that any nonlinear modelling of data, suitable to study causality, should satisfy the property PI, 
described in Section (|TTI) . It is clear that this property sets a limit on the accuracy of the model; we have proposed a 
class of nonlinear models which satisfy PI and constructed an RBF like approach to nonlinear Granger causality. Its 
performances, in a simulated case and a real physiological application, have been presented. We conclude remarking 
that use of this definition of nonlinear causality may lead to discover genuine causal structures via data analysis, but 
to validate the results the analysis has to be accompanied by substantive theory. 



B. Physiological data. 



P(x I X) 
P(y I Y) 



P(x\X,Y), 

^(y|x,Y). 



(16) 



IV. CONCLUSIONS. 
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V. APPENDIX. 



We show how the choice of functions arise in the frame of regularization theory. Let z be a function of X and 
Y. We assume that z is the sum of a term depending solely on X and one depending on Y: z(X, Y) = /(X) + g(Y). 
We also assume the knowledge of the values of / and g at points {X p , Y p } p= i j .. irl : 

f(XP) = f p P=h-,n, . 7) 

g(Y?) = g» p=l,...,n. { ' 

Let us denote K(uj) the Fourier transform of K(r) = exp(— r 2 /2cr 2 ). The following functional is a measure of the 
smoothness of z(X,Y): 

«H-/« M , (18) 

Indeed it penalizes functions with relevant contributions from high frequency modes. Variational calculus shows that 
the function that minimize S under the constraints l|17|) is given by: 

n n 

z = ^ Mp ^(x-X p )+^A p if (Y-Y"); (19) 
P =i P =i 

where {^} and {A} are tunable Lagrange multipliers to solve JT7J|. Hence model (JBJ-JTJl corresponds to the class of 
the smoothest functions, sum of a term depending on X and a term depending on Y, with assigned values on a set 
of n points. 
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FIG. 1: The second Lyapunov exponent of the coupled maps il l oft is plotted versus coupling e. 
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FIG. 2: (a) For the noise free case of coupled maps 1150 . ci = e x — e xy (dashed line) and C2 = t y — e yx (solid line) are plotted 
versus coupling e. (b) The directionality index D (see the text) is plotted versus e in the noise free case, (c) The directionality 
index D is plotted versus e, s = 0.01. (d) D is plotted versus e, s — 0.07. 
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FIG. 3: Time series of the heart RR (upper) and breath signal (lower) of a patient suffering sleep apnea. Data sampled at 2 
Hz. 
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FIG. 4: The directionality index D is plotted versus a for the physiological application, around a — 0.5. Solid line is the 
3t/i-polynomial best fit of points, here shown only for illustrative purposes. 



